Business Understanding

Project summary

According to Epsilon research, 80% of customers are more likely to do business with a company if it provides personalized service. Banking is no exception. The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner, often before they have even realized they need the service.

Santander Group aims to go a step beyond providing a customer with financial service and intends to determine the amount or value of the customer’s transaction. The primary focus is on Digital Delivery of Financial Services, reinforcing distribution and the overall customer experience in the new digital environment. The company strives to achieve it by using Big Data Analytics with platforms to leverage financial and non-financial information. That means anticipating customer needs in a more concrete, but also simple and personal way.

With so many choices for financial services, this need is greater now than ever before. This is a first step that Santander strives to nail in order to personalize their services at scale.

Examples of Analytics in Banking

A US bank used machine learning to study the discounts its private bankers were offering to customers. Bankers claimed that they offered discounts only to valuable ones and more than made up for that with other, high-margin business. The analytics showed something different: patterns of unnecessary discounts that could easily be corrected. After the unit adopted the changes, revenues rose by 8% within a few months.

A top consumer bank in Asia enjoyed a large market share but lagged behind its competitors in products per customer. It used advanced analytics to explore several sets of big data: customer demographics & key characteristics, products held, credit-card statements, transaction & point-of-sale data, online and mobile transfers & payments, and credit-bureau data. The bank discovered unsuspected similarities that allowed it to define 15,000 microsegments in its customer base. It then built a next-product-to-buy model that increased the likelihood of buying three times over.

Project objective and data mining problem definition

Project Objectives:

  • Anticipate customer needs
  • Provide personalized banking
  • Identify value of transactions for each potential customer

Santander Group wants to predict the value of future customer transactions (target column) in the test set with the minimal error. The evaluation metric for this project is Root Mean Squared Logarithmic Error.

To solve that challenge, we are planning to follow CRISP-DM outline:

  1. Perform preprocessing and EDA.
  2. Validate the provided partition of the available data into a training set and a test set.
  3. Build a data mining model using the training set data. We are planning to use 2 supervised (regression decision tree vs linear regression) methods and compare the results.
  4. Evaluate the data mining model using the test set data and achieve the minimal error across methods.
  5. Determine whether all the facets of the objective have been addressed or there are subtler interesting areas.
  6. Make conclusions about the model results and produce the report for deployment.

Data Understanding & Analysis

Data at a glance

We are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column.

File descriptions:

  • train.csv - the training set;
  • test.csv - the test set.
transaction.data <- read.csv(file="train.csv", header=TRUE, sep=",")
test_non_zero_base<-read.csv("test.csv", header= TRUE, sep=",")
attach(transaction.data)
attach(test_non_zero_base)
options("scipen" = 999, "digits" = 10)
set_plot_dimensions <- function(width_choice, height_choice) {
        options(repr.plot.width=width_choice, repr.plot.height=height_choice)
        }
str(transaction.data, list.len = 10, vec.len = 5)
## 'data.frame':    4459 obs. of  4993 variables:
##  $ ID        : Factor w/ 4459 levels "000d6aaf2","000fbd867",..: 1 2 3 4 5 6 7 8 9 10 11 12 ...
##  $ target    : num  38000000 600000 10000000 2000000 14400000 2800000 164000 600000 979000 460000 1100000 16000000 ...
##  $ X48df886f9: num  0 0 0 0 0 0 0 0 0 0 0 0 ...
##  $ X0deb4b6a8: int  0 0 0 0 0 0 0 0 0 0 0 0 ...
##  $ X34b15f335: num  0 0 0 0 0 0 0 0 0 0 0 0 ...
##  $ a8cb14b00 : int  0 0 0 0 0 0 0 0 0 0 0 0 ...
##  $ X2f0771a37: int  0 0 0 0 0 0 0 0 0 0 0 0 ...
##  $ X30347e683: int  0 0 0 0 0 0 0 0 0 0 0 0 ...
##  $ d08d1fbe3 : int  0 0 0 0 0 0 0 0 0 0 0 0 ...
##  $ X6ee66e115: int  0 0 0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
summary <- summary.data.frame(transaction.data)
summary[1:6, 1:10]
##          ID           target           X48df886f9         
##  000d6aaf2:   1   Min.   :   30000   Min.   :       0.00  
##  000fbd867:   1   1st Qu.:  600000   1st Qu.:       0.00  
##  0027d6b71:   1   Median : 2260000   Median :       0.00  
##  0028cbf45:   1   Mean   : 5944923   Mean   :   14654.93  
##  002a68644:   1   3rd Qu.: 8000000   3rd Qu.:       0.00  
##  002dbeb22:   1   Max.   :40000000   Max.   :20000000.00  
##    X0deb4b6a8            X34b15f335            a8cb14b00           
##  Min.   :      0.000   Min.   :       0.00   Min.   :       0.000  
##  1st Qu.:      0.000   1st Qu.:       0.00   1st Qu.:       0.000  
##  Median :      0.000   Median :       0.00   Median :       0.000  
##  Mean   :   1390.895   Mean   :   26722.45   Mean   :    4530.164  
##  3rd Qu.:      0.000   3rd Qu.:       0.00   3rd Qu.:       0.000  
##  Max.   :4000000.000   Max.   :20000000.00   Max.   :14800000.000  
##    X2f0771a37             X30347e683            d08d1fbe3          
##  Min.   :        0.00   Min.   :       0.00   Min.   :       0.00  
##  1st Qu.:        0.00   1st Qu.:       0.00   1st Qu.:       0.00  
##  Median :        0.00   Median :       0.00   Median :       0.00  
##  Mean   :    26409.96   Mean   :   30708.11   Mean   :   16865.22  
##  3rd Qu.:        0.00   3rd Qu.:       0.00   3rd Qu.:       0.00  
##  Max.   :100000000.00   Max.   :20708000.00   Max.   :40000000.00  
##    X6ee66e115          
##  Min.   :       0.000  
##  1st Qu.:       0.000  
##  Median :       0.000  
##  Mean   :    4669.208  
##  3rd Qu.:       0.000  
##  Max.   :10400000.000

Preliminary observations:

  1. Time series nature - the dataset appears to be a time series in both dimensions, row wise and column wise.

  2. Disguised meaning of the columns - each column seems to represent individual transaction amounts, possibly related to different types.

Exploratory Data Analysis

library(DataExplorer)
library(ggplot2)
library(data.table)
library(dplyr)
library(plotly)
library(e1071)
library(tidyr)
library(purrr)
library(compare)
library(tidyverse)
library(caret)
library(leaps)
library(MASS)
library(mltools)
library(psych)
library(rpart)
library(rpart.plot)
library(arules)
library(arulesViz)
library(knitr)
library(randomForest)

First, we want to assess the data quality in terms of missing values and take a closer look at the target variable, its distribution, and summary statistics.

#plot_missing(transaction.data)
#transaction.data[!complete.cases(transaction.data),]
#sapply(transaction.data, function(x) sum(is.na(x)))
#Due to the size of the data set, commands above are difficult to print in the report
sum(is.na(transaction.data))
## [1] 0
ggplot(transaction.data,aes(x=target))+geom_histogram(fill="blue",bins=50)+scale_x_continuous(trans='log2')+ggtitle("Histogram Distribution of Target")

box_plot <- ggplot(transaction.data, aes(y= target)) + 
  geom_boxplot() + 
  ylab("Target") +
  scale_y_continuous(trans='log2')+
  ggtitle("Box Plot of Target")
box_plot

qqnorm(transaction.data$target,
      datax = TRUE,
      col = "red",
      main = "Normal Q-Q Plot of target Distribution")
qqline(transaction.data$target,
      col = "blue",
      datax = TRUE)

(min(target))
## [1] 30000
(max(target))
## [1] 40000000
(target_lcutoff <- quantile(target,.25))
##    25% 
## 600000
(target_ucutoff <- quantile(target,.75))
##     75% 
## 8000000
(median(target))
## [1] 2260000
(mean(target))
## [1] 5944923.322

As we can see, there are no missing values. The target variable is not normally distributed with several outliers that we will need to pay attention to during Data Preparation stage. The mean is higher than the median, so the distribution is right-skewed. Also, looking at the min and max, the range is very wide.

Next, we dig deeper into the preliminary observations from the previous section. The broader hypothesis that we analyze is: columns and rows were originally time ordered and then shuffled for the competition.

num_rows / days_in_week : 4459 / 7 = 637

num_cols / days_in_week : 4991 / 7 = 713

This serves as an additional point in support of the hypothesis that the data represents weekly transactional activity. Based on other observations, this dataset does not seem to contain any aggregate features.

To prepare for modeling, we want to better understand the meaning of columns & rows. Further, we evaluate whether all the data is truly significant for our analysis. The key criterion is the number of zeros vs. the number of unique values.

tran.data.zero<-data.table(transaction.data)
n_zeros <- tran.data.zero[, lapply(.SD, function(x) sum(x == 0) / length(x))] %>% unlist
a <-list(
  autotick = FALSE,
  ticks = "outside",
  tick0 = 0.6,
  dtick = 0.1,
  range = c(0.6, 1),
  ticklen = 5,
  tickwidth = 2,
  tickcolor = toRGB("blue")
)
plot_ly(x = ~n_zeros, type = "histogram",
       marker = list(color = "dodgerblue")) %>% 
  layout(xaxis = a, title = "Histogram of % of zeros in dataset", titlefont = list(color = '#000000', size = 20), margin = list(l = 50, t=40))

Source: Kaggle

As a start, we select the subset of the training data where columns have more than 800 non-zero values and rows have more than 650.

x<-colSums(transaction.data != 0)
y<-colnames(transaction.data)
x_name<-"Count"
y_name<-"Col_name"
Train_nz<- data.frame(x, y)
colnames(Train_nz) <- c(x_name, y_name)
#Include columns with non_zero values greater than 800
Subset1<-Train_nz[Train_nz$Count>800,]
Subset1$Col_name<-as.character(Subset1$Col_name)
#head(Subset1$Col_name)
#str(Subset1$Col_name)
train_non_zero<-transaction.data[Subset1$Col_name]
#head(train_non_zero,3)

w<-rowSums(transaction.data != 0)
t<-rownames(transaction.data)
w_name<-"Count"
t_name<-"Row_name"
Train_nz2<- data.frame(w, t)
colnames(Train_nz2) <- c(w_name, t_name)
#head(Train_nz2)
#Include rows with non_zero values greater than 650
Subset1a<-Train_nz2[Train_nz2$Count>650,]
Subset1a$Row_name<-as.character(Subset1a$Row_name)
#head(Subset1a$Row_name)
#str(Subset1a$Row_name)
train_non_zero<-train_non_zero[Subset1a$Row_name,]
head(train_non_zero,3)
##            ID  target X20aa07010 X963a49cdc X68a945b18 X935ca66a9
## 6   002dbeb22 2800000   17020000   17020000   17020000   17020000
## 78  03ec1414d  200000     350000     350000     300000     300000
## 115 06cd03d09 3095200    4005500    3420000     167000    1109500
##     X8.61E.26 X26fc93eb7 X0572565c2 X66ace2992   fb49e4212 X6619d81fc
## 6    17020000   17020000   17020000   17020000 17020000.00   17020000
## 78     300000     350000     350000     350000   350000.00     350000
## 115   1100000    8260400      40000          0 16818666.66    3252500
##      aca228668 X6eef030c1  bc70cbc26 fc99f9426 X7e814a30d bd6da0cca
## 6   8891333.34       0.00 8891333.34  17020000 8891333.34  17020000
## 78   685000.00       0.00  685000.00    350000  685000.00    300000
## 115  819500.00 1859333.34 1614909.10  22134000  236857.14    500000
##     X1db387535   b43a7cfd5  X024c577b9 ea772e115 ad009c8b9 X2ec5b290f
## 6     17020000 17020000.00 17020000.00  17020000  17020000   17020000
## 78      350000   350000.00   350000.00    300000    300000     350000
## 115    2090000   282333.34  1554666.66         0   4240000    3012000
##     X0ff32eb98 X166008929 X58e056e12 X241f0f867 cbb673163 X1931ccfdd
## 6     17020000   17020000          0   17020000  17020000   17020000
## 78      350000     300000     350000     350000    300000     350000
## 115     547000     140000    3862000   10491200  24200000   14145000
##     f02ecb19c X58e2e02e6 X1fe5d56b9 X9fd594eec   fb0f5dbfe X91f701ba2
## 6    17020000          0   17020000   17000000         0.0   17020000
## 78     300000   17000000     300000    4000000    350000.0     350000
## 115    600000          0          0    7279000 111791333.3   25100000
##     ca2b906e8  X703885424 f97d9431e X939f628a7 X62fb56487 eeb9cd3aa
## 6    17020000 17020000.00  17020000 8891333.34   17020000   2800000
## 78     300000   350000.00    300000  685000.00     300000   5700000
## 115   6880000  1448888.88  25100000  371200.00    3420000   1060000
##      X324921c7b  X58232a6fb c0d2348b7 X491b9ee45 c8d582dd2 d6bb78916
## 6   17020000.00 17020000.00  17020000   17020000  17020000  17020000
## 78    350000.00   350000.00    300000     350000    300000    350000
## 115 11973333.34 10814666.66   2090000     450000    880000     42000
##     c10f31664 X4bcf15776 X70feb1494 adb64ff71 X11e12dbe8 X9de83dc23
## 6    17020000   17020000   17020000  17020000   17020000   17020000
## 78     300000     300000     350000    350000     300000     300000
## 115  38800000    5200000    2190000  25300000   30000000   25300000
##      X62e59a501 X15ace8c9f X5c6487af1 bb0ce54e9 f190486d6 f74e8f13d
## 6   17020000.00     556000 17020000.0  17020000         0  17020000
## 78    350000.00     350000   350000.0    300000         0    350000
## 115  5352666.66     972000   147090.9   4400000   3530500   2286500
##     X5d3b81ef8 X77deffdf0 c5a231d81 e176a204a  X1702b5bf0 a09a238d0
## 6   8891333.34   17020000  17020000  17020000 17020000.00  17020000
## 78   685000.00     300000    350000    350000   350000.00    300000
## 115  929600.00   26400000  11115200  30000000  3056888.88         0
##     X190db8488   c47340d97 X8781e4b91  ef30f6be5 X23310aa6f
## 6     17020000 17020000.00   17020000 8891333.34   17020000
## 78      350000   350000.00     300000  685000.00     350000
## 115     819750  2253333.34          0  660000.00   23985000
write.csv(train_non_zero, file = "train_non_zero.csv",row.names=FALSE)

This approach allows to identify ~70 variables and ~140 observations that appear to be the most impactful for the target variable. We also added a column with the mean value for each row.

Proceeding further, more advanced algorithms could be used to detect the patterns between columns and rows. For example, a mix of feature importance, sorting columns and rows by sum of non-zeros, and correlation plus RMSE between columns.

We could also consider Principal Component Analysis to further group the variables.

train2<-subset(train_non_zero,select=-c(target,ID))
pc<-prcomp(train2)
summary(pc)
## Importance of components:
##                                   PC1            PC2            PC3
## Standard deviation     28452263.12503 13814069.95623 13658500.42837
## Proportion of Variance        0.20904        0.04928        0.04817
## Cumulative Proportion         0.20904        0.25831        0.30648
##                                   PC4            PC5            PC6
## Standard deviation     13495921.79137 13305484.60045 12829748.54822
## Proportion of Variance        0.04703        0.04571        0.04250
## Cumulative Proportion         0.35351        0.39923        0.44173
##                                   PC7            PC8            PC9
## Standard deviation     12653298.27604 12382544.67908 12003277.99093
## Proportion of Variance        0.04134        0.03959        0.03720
## Cumulative Proportion         0.48307        0.52267        0.55987
##                                  PC10           PC11           PC12
## Standard deviation     11789582.14638 11236817.00212 11118053.71473
## Proportion of Variance        0.03589        0.03260        0.03192
## Cumulative Proportion         0.59576        0.62836        0.66028
##                                  PC13           PC14           PC15
## Standard deviation     10837418.56124 10668063.29444 10382706.24548
## Proportion of Variance        0.03033        0.02939        0.02784
## Cumulative Proportion         0.69061        0.72000        0.74783
##                                 PC16          PC17          PC18
## Standard deviation     9939811.05391 9696207.04405 8555770.81417
## Proportion of Variance       0.02551       0.02428       0.01890
## Cumulative Proportion        0.77335       0.79762       0.81652
##                                 PC19          PC20          PC21
## Standard deviation     8114606.09850 7655532.37539 7177005.58347
## Proportion of Variance       0.01700       0.01513       0.01330
## Cumulative Proportion        0.83353       0.84866       0.86196
##                                 PC22          PC23          PC24
## Standard deviation     7092703.81095 6836254.75690 6436286.08787
## Proportion of Variance       0.01299       0.01207       0.01070
## Cumulative Proportion        0.87495       0.88702       0.89772
##                                 PC25          PC26          PC27
## Standard deviation     6066468.41110 5916894.32180 5844677.62758
## Proportion of Variance       0.00950       0.00904       0.00882
## Cumulative Proportion        0.90722       0.91626       0.92508
##                                 PC28          PC29          PC30
## Standard deviation     5658296.64025 5366417.45323 4979286.03273
## Proportion of Variance       0.00827       0.00744       0.00640
## Cumulative Proportion        0.93335       0.94078       0.94718
##                                 PC31          PC32          PC33
## Standard deviation     4270083.81550 4130348.39512 4000862.76699
## Proportion of Variance       0.00471       0.00441       0.00413
## Cumulative Proportion        0.95189       0.95630       0.96043
##                                 PC34          PC35          PC36
## Standard deviation     3889047.66908 3763794.43003 3556414.95474
## Proportion of Variance       0.00391       0.00366       0.00327
## Cumulative Proportion        0.96434       0.96799       0.97126
##                                 PC37          PC38          PC39
## Standard deviation     3487221.74456 3393276.63263 3109433.79942
## Proportion of Variance       0.00314       0.00297       0.00250
## Cumulative Proportion        0.97440       0.97737       0.97987
##                                 PC40          PC41          PC42
## Standard deviation     2971497.79532 2828930.86913 2685540.31369
## Proportion of Variance       0.00228       0.00207       0.00186
## Cumulative Proportion        0.98215       0.98422       0.98608
##                                 PC43          PC44          PC45
## Standard deviation     2523885.28039 2349239.93486 2293042.40406
## Proportion of Variance       0.00164       0.00143       0.00136
## Cumulative Proportion        0.98772       0.98915       0.99051
##                                 PC46          PC47          PC48
## Standard deviation     2183298.96044 2054172.21424 1982366.01076
## Proportion of Variance       0.00123       0.00109       0.00101
## Cumulative Proportion        0.99174       0.99283       0.99384
##                                 PC49          PC50          PC51
## Standard deviation     1852424.03085 1808851.76826 1647618.96710
## Proportion of Variance       0.00089       0.00084       0.00070
## Cumulative Proportion        0.99473       0.99557       0.99627
##                                 PC52          PC53          PC54
## Standard deviation     1565368.66280 1463802.14245 1417607.76641
## Proportion of Variance       0.00063       0.00055       0.00052
## Cumulative Proportion        0.99691       0.99746       0.99798
##                                 PC55          PC56          PC57
## Standard deviation     1259265.90789 1202521.97104 1055083.47765
## Proportion of Variance       0.00041       0.00037       0.00029
## Cumulative Proportion        0.99839       0.99876       0.99905
##                                PC58         PC59         PC60        PC61
## Standard deviation     910067.00528 861731.07000 766484.55635 611359.0109
## Proportion of Variance      0.00021      0.00019      0.00015      0.0001
## Cumulative Proportion       0.99926      0.99945      0.99961      0.9997
##                                PC62         PC63         PC64         PC65
## Standard deviation     516826.32229 468050.48849 456246.29974 432446.59388
## Proportion of Variance      0.00007      0.00006      0.00005      0.00005
## Cumulative Proportion       0.99977      0.99983      0.99988      0.99993
##                                PC66         PC67         PC68         PC69
## Standard deviation     345668.88231 259402.47042 224748.79172 178658.51501
## Proportion of Variance      0.00003      0.00002      0.00001      0.00001
## Cumulative Proportion       0.99996      0.99998      0.99999      1.00000
#plot(pc)
plot(pc,type="l")

set_plot_dimensions(1200, 1200)
biplot(pc, cex.lab=0.8, cex.axis=0.8, cex.main=0.7, cex.sub=0.5)

#attributes(pc)

The Principal Component analysis identified 69 components which is close to the total number of variables in our latest subset. Therefore, we proceed with train_non_zero subset.

Once the desired subset is selected, we analyze its structure and identify the necessary elements for the data preparation process.

str(train_non_zero, list.len = 10, vec.len = 5)
## 'data.frame':    141 obs. of  71 variables:
##  $ ID        : Factor w/ 4459 levels "000d6aaf2","000fbd867",..: 6 73 109 121 145 150 173 186 207 228 238 373 ...
##  $ target    : num  2800000 200000 3095200 10814667 200000 25100000 ...
##  $ X20aa07010: num  17020000 350000 4005500 3420000 747474 600000 ...
##  $ X963a49cdc: num  17020000 350000 3420000 25300000 747474 35050000 ...
##  $ X68a945b18: num  17020000 300000 167000 1109500 1330000 1800000 ...
##  $ X935ca66a9: num  17020000 300000 1109500 1800000 1330000 0 ...
##  $ X8.61E.26 : num  17020000 300000 1100000 4900000 1330000 715000 ...
##  $ X26fc93eb7: num  17020000 350000 8260400 2253333 747474 2186000 ...
##  $ X0572565c2: num  17020000 350000 40000 2300000 747474 0 ...
##  $ X66ace2992: num  17020000 350000 0 25100000 747474 430000 ...
##   [list output truncated]
summary.subset <- summary.data.frame(train_non_zero)
summary.subset[1:6, 1:10]
##          ID          target           X20aa07010          X963a49cdc      
##  002dbeb22:  1   Min.   :   30000   Min.   :        0   Min.   :       0  
##  03ec1414d:  1   1st Qu.:  380000   1st Qu.:   307000   1st Qu.:  338000  
##  06cd03d09:  1   Median : 1359000   Median :   834800   Median :  814800  
##  077787b5d:  1   Mean   : 4380924   Mean   :  4276344   Mean   : 3421982  
##  08c1f91b0:  1   3rd Qu.: 5187333   3rd Qu.:  3252500   3rd Qu.: 2186000  
##  08f544828:  1   Max.   :35050000   Max.   :111791333   Max.   :35050000  
##    X68a945b18         X935ca66a9           X8.61E.26       
##  Min.   :       0   Min.   :       0.0   Min.   :       0  
##  1st Qu.:   65000   1st Qu.:  140000.0   1st Qu.:   92000  
##  Median :  444000   Median :  747333.3   Median :  732000  
##  Mean   : 2992850   Mean   : 2827357.4   Mean   : 2533558  
##  3rd Qu.: 1775333   3rd Qu.: 1775333.3   3rd Qu.: 1775333  
##  Max.   :38800000   Max.   :30000000.0   Max.   :30000000  
##    X26fc93eb7         X0572565c2         X66ace2992      
##  Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:  338000   1st Qu.:  350000   1st Qu.:  338000  
##  Median :  819750   Median :  834800   Median :  819750  
##  Mean   : 3410870   Mean   : 2996867   Mean   : 3518926  
##  3rd Qu.: 1933333   3rd Qu.: 1563412   3rd Qu.: 2253333  
##  Max.   :35050000   Max.   :35050000   Max.   :35050000
plot_histogram(train_non_zero)

plot_correlation(train_non_zero,type="continuous", theme_grey(base_size=5))

Data Preparation

Note: the provided test set does not contain the target variable, so our evaluation of the models will have to be based on the training set. However, we still apply all the necessary data preparation steps to the test set to showcase the process.

1. Validate the partition.

First, we create a similar subset from the test data, using exactly the same columns and similar non-zero rows.

#Compare test and train base files
comparison <- compare(transaction.data,test_non_zero_base,allowAll=TRUE)
comparison$tM
semi_join(transaction.data,test_non_zero_base)

It seems like there aren’t any common rows.

subset_colnames<-colnames(train_non_zero)
subset_ID<-as.character(train_non_zero$ID)
test_names<-names(test_non_zero_base)[names(test_non_zero_base) %in% subset_colnames]
test_ID<-test_non_zero_base$ID[test_non_zero_base$ID %in% subset_ID]
test_non_zero <-test_non_zero_base[, test_names]


z<-rowSums(test_non_zero_base != 0)
q<-rownames(test_non_zero_base)
z_name<-"Count"
q_name<-"Row_name"
Train_nz3<- data.frame(z, q)
colnames(Train_nz3) <- c(z_name, q_name)
#head(Train_nz3)
#Include rows with non_zero values greater than 950
Subset1b<-Train_nz3[Train_nz3$Count>950,]
Subset1b$Row_name<-as.character(Subset1b$Row_name)
#head(Subset1b$Row_name)
#str(Subset1b$Row_name)
test_row_names<-rownames(test_non_zero)[rownames(test_non_zero) %in% Subset1b$Row_name]
test_non_zero<-test_non_zero[test_row_names, ]
head(test_non_zero,3)
##             ID X20aa07010  X963a49cdc X68a945b18 X935ca66a9 X26fc93eb7
## 874  046868511          0        0.00      97000    1180000  136800.00
## 1369 07061afcb    8120000 20000000.00    1000000    2276500  340666.66
## 1584 08337f345     380000   110666.66     600000      68000  186000.00
##      X0572565c2 X66ace2992  fb49e4212 X6619d81fc aca228668 X6eef030c1
## 874     5510800   59333.34   78000.00 6060000.00 161619.04     186000
## 1369     186000  380000.00 8478666.66 3840000.00 151500.00    5345600
## 1584     656400 3840000.00 1137555.56 1427230.76 200000.00    4139600
##      bc70cbc26  fc99f9426 X7e814a30d bd6da0cca X1db387535   b43a7cfd5
## 874     918000 4609500.00  205600.00   1248000   39000.00  1427230.76
## 1369         0  211333.34  458000.00    116000 8583428.58  1372800.00
## 1584    450000  237000.00  519333.34         0  340000.00 20000000.00
##      X024c577b9 ea772e115 ad009c8b9 X2ec5b290f  X0ff32eb98 X166008929
## 874      221000         0  15800000     190800  1347166.66    7252000
## 1369   11700000         0    281000    2290800   221000.00     660000
## 1584     564800   4900000  11187200    1020500 15800000.00   15800000
##      X58e056e12 X241f0f867 cbb673163 X1931ccfdd f02ecb19c X58e2e02e6
## 874   115500.00 1200000.00         0       0.00         0 8583428.58
## 1369       0.00 9731666.66    800000  664000.00         0 2340400.00
## 1584 8478666.66  970500.00   2276500 1124166.66         0       0.00
##      X1fe5d56b9 X9fd594eec   fb0f5dbfe  X91f701ba2 ca2b906e8 X703885424
## 874       14000  110666.66 12686800.00 12307000.00   1720000    1000000
## 1369   25400000  369200.00  2600000.00   110666.66    111000    6206000
## 1584    2560000 6062500.00  9731666.66   383200.00    660000   12686800
##      f97d9431e X939f628a7 X62fb56487   eeb9cd3aa X324921c7b X58232a6fb
## 874    1280000   34666.66          0   211333.34     383200  340000.00
## 1369     80000  545333.34          0  5600000.00    6062500       0.00
## 1584    483000  132500.00      80000 52932000.00    6206000  340666.66
##      c0d2348b7 X491b9ee45 c8d582dd2 d6bb78916 c10f31664 X4bcf15776
## 874      39000 1554666.66         0    200000    268000   68000.00
## 1369  11187200 1020500.00  15010000  22529000         0  664666.66
## 1584         0       0.00   9307200   3761200         0  116000.00
##      X70feb1494 adb64ff71 X11e12dbe8 X9de83dc23 X62e59a501 X15ace8c9f
## 874   500000.00    911000          0    1424000  656400.00 1124166.66
## 1369  525066.66  12686800    2560000    9307200 4139600.00 3677000.00
## 1584  115500.00   1200000     200000    1200000  525066.66 2290800.00
##      X5c6487af1  bb0ce54e9  f190486d6   f74e8f13d X5d3b81ef8 X77deffdf0
## 874  2456800.00       0.00 3840000.00 15800000.00     200000    9106000
## 1369 3761200.00 4900000.00 6266666.66   564800.00     634000     600000
## 1584  211333.34   84666.66 1372800.00  8583428.58      10000          0
##      c5a231d81  e176a204a X1702b5bf0 a09a238d0 X190db8488 c47340d97
## 874   16000.00 3540000.00     237000  13100000  984000.00         0
## 1369      0.00 1137555.56   52932000         0 1124166.66    115500
## 1584  59333.34  200000.00     664000   1100000  190800.00     78000
##      X8781e4b91 ef30f6be5 X23310aa6f
## 874   800000.00   6000.00    1180000
## 1369 1650666.66 950333.34     970500
## 1584  290000.00 600000.00     221000
write.csv(test_non_zero, file = "test_non_zero.csv",row.names=FALSE)

It’s interesting that 1 column name from train_non_zero was not found in the initial test file. We would have to go back to the client to learn about the reasons: was it a typo or some variables were omitted on purpose or something else?

#Compare test and train subsets
comparison <- compare(train_non_zero,test_non_zero,allowAll=TRUE)
comparison$tM
semi_join(train_non_zero,test_non_zero)

Next, we conduct several two-sample T-tests for difference in means. The null hypothesis is that the means are similar and the partition is valid. The alternative hypothesis is that the means are significantly different and the partition is invalid. We assume the significance level of 5%.

mean1<-mean(train_non_zero[,3])
mean2<-mean(test_non_zero[,2])
sd1<-sd(train_non_zero[,3])
sd2<-sd(test_non_zero[,2])
l1<-length(train_non_zero[,3])
l2<-length(test_non_zero[,2])
dfs <- min(l1 - 1, l2 - 1)
tdata <- (mean1 - mean2) / sqrt((sd1^2/l1)+(sd2^2/l2))
pvalue <- 2*pt(tdata, df = dfs, lower.tail=FALSE)
tdata; pvalue
## [1] -1.704320307
## [1] 1.908339802

Based on the test for the first predictor column, the p-value is higher than 0.05, so we don’t have enough evidence to reject the null hypothesis and the partition appears valid.

mean3<-mean(train_non_zero[,4])
mean4<-mean(test_non_zero[,3])
sd3<-sd(train_non_zero[,4])
sd4<-sd(test_non_zero[,3])
l3<-length(train_non_zero[,4])
l4<-length(test_non_zero[,3])
dfs <- min(l3 - 1, l4 - 1)
tdata1 <- (mean3 - mean4) / sqrt((sd3^2/l3)+(sd4^2/l4))
pvalue1 <- 2*pt(tdata1, df = dfs, lower.tail=FALSE)
tdata1; pvalue1
## [1] -2.157003084
## [1] 1.966416377

The previous conclusion is confirmed by the next variable as well, so we will assume the partition is valid for the goals of the modeling.

2. Standardize and normalize variables

To start off, we standardize both data sets using z-score method.

train_non_zero_scaled<-scale(train_non_zero[,-1])
train_non_zero_scaled<-data.frame(train_non_zero_scaled)
train_non_zero_scaled$ID<-train_non_zero$ID 
train_non_zero_scaled<-train_non_zero_scaled[c(71,1:70)]
train_non_zero_scaled$target<-train_non_zero$target

#Plot correlation
plot_correlation(train_non_zero_scaled,type="continuous")

pairs(train_non_zero_scaled[2:10])

test_non_zero_scaled<-scale(test_non_zero[,-1])
test_non_zero_scaled<-data.frame(test_non_zero_scaled)
test_non_zero_scaled$ID<-test_non_zero$ID 
test_non_zero_scaled<-test_non_zero_scaled[c(69,1:68)]

#Plot correlation
#plot_correlation(test_non_zero_scaled,type="continuous")
#pairs(test_non_zero_scaled[2:10])

Next, we analyze and, if necessary, remove outliers.

outliers <- function(dataframe){
  dataframe %>%
    select_if(is.numeric) %>% 
    map(~ boxplot.stats(.x)$out) 
}
head(outliers(train_non_zero_scaled), 5)
## $target
##  [1] 25100000.00 23985000.00 17000000.00 12666666.66 16818666.66
##  [6] 30000000.00 35050000.00 13263000.00 20000000.00 14145000.00
## [11] 17000000.00 14000000.00 16733333.34 22134000.00 25300000.00
## [16] 17000000.00
## 
## $X20aa07010
##  [1] 1.1573695130 0.8962641201 0.6210999047 0.6990349457 1.1573695130
##  [6] 1.1573695130 1.7899257083 2.7948408987 1.1555531277 1.1313346570
## [11] 0.3490780359 1.8911891911 0.5938056870 1.6218192449 9.7644325446
## [16] 0.5644286820 1.9093530445 0.3618290610 1.1390845666 1.1573695130
## [21] 0.7014567922
## 
## $X963a49cdc
##  [1] 2.0828594576 3.3511381057 4.8445821659 1.1783990999 3.1497146247
##  [6] 2.0828594576 2.0828594576 3.3205033557 0.3107157596 2.0828594576
## [11] 0.7411186791 0.7196130846 1.3139272337 0.2957302601 1.0828186801
## [16] 4.0710547296 2.0389496503 2.8661900139 1.3098426014 2.0520204749
## [21] 2.0828594576 1.6424849270
## 
## $X68a945b18
##  [1] 2.2454253992 0.3052909793 0.9776147882 2.2454253992 2.2454253992
##  [6] 3.7469485724 2.2454253992 0.5665368022 0.2252524306 3.5388483458
## [11] 2.3862932449 0.3533141085 0.6222436320 1.3988576698 3.2987326998
## [16] 0.3110537548 0.2252524306 3.5708637653 1.0576533368 1.7129022564
## [21] 2.2454253992 0.7531066591 5.7319045796
## 
## $X935ca66a9
##  [1] 2.3969579776 0.3500431354 3.9811214445 1.0593700609 2.3969579776
##  [6] 2.3969579776 2.3969579776 3.7953453450 0.8225055340 0.6256673122
## [11] 3.6095692455 0.6844401146 0.2655994538 3.5082368275 4.5891159521
## [16] 3.7615678724 1.1438137425 0.2655994538 0.3561230805 2.3969579776

Looking at the identified outliers, we don’t feel like we possess enough domain and client knowledge at this point to make a decision to remove outliers. Therefore, we will proceed to modelling with the full standardized data set and, if necessary, make iterative adjustments based on the learnings of the next phase.

Finally, we transform the target variable to improve its normality for modelling.

summary(train_non_zero_scaled$target)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    30000   380000  1359000  4380924  5187333 35050000
train_non_zero_scaled$log_target<-log(train_non_zero_scaled$target)
train_non_zero_scaled<-train_non_zero_scaled[c(1:2,72,3:71)]
skew(train_non_zero_scaled$log_target)
## [1] -0.04391634008
ggplot(train_non_zero_scaled,aes(x=log_target))+geom_histogram(fill="blue")+ggtitle("Histogram of Normalized Target")

qqnorm(train_non_zero_scaled$log_target,
      datax = TRUE,
      col = "red",
      main = "Normal Q-Q Plot of log_target Distribution")
qqline(train_non_zero_scaled$log_target,
      col = "blue",
      datax = TRUE)

plot(rowSums(train_non_zero_scaled[,4:43]), train_non_zero_scaled$log_target, main="Scatterplot of log_target vs row sums", xlab="Row Sums", ylab="Log of Target", pch=16)

Based on the learnings from modelling, we might also consider normalizing predictor variables. However, we need to keep in mind the negative impact it would have on the ability to interpret the results of, for example, linear regression. Using base dollar values makes it much easier to understand the nature of the effect that predictors have on the target.

3. Bin variables

For some of the numeric predictors, we see the opportunity to apply equal frequency binning.

#Binning one of the predictors
N1 <- length(train_non_zero_scaled[,4])
nbins1 <- 5
whichbin1 <- c(rep(0, N1))
freq1 <- N1/nbins1
train_non_zero_scaled <- train_non_zero_scaled[order(train_non_zero_scaled[,4]),]
for (i in 1:nbins1) {
  for (j in 1:N1) {
    if((i-1)*freq1 < j && j <=i*freq1)
      whichbin1[j] <- i
  }
}
whichbin1<-gsub(pattern = "1", replacement = "VLow", whichbin1)
whichbin1<-gsub(pattern = "2", replacement = "Low", whichbin1)
whichbin1<-gsub(pattern = "3", replacement = "Medium", whichbin1)
whichbin1<-gsub(pattern = "4", replacement = "High", whichbin1)
whichbin1<-gsub(pattern = "5", replacement = "VHigh", whichbin1)
train_non_zero_scaled[,4]<-whichbin1


#plot frequencies in the bins
barplot(table(train_non_zero_scaled[,4]))

N2 <- length(test_non_zero_scaled[,2])
nbins2 <- 5
whichbin2 <- c(rep(0, N2))
freq2 <- N2/nbins2
test_non_zero_scaled <- test_non_zero_scaled[order(test_non_zero_scaled[,2]),]
for (i in 1:nbins2) {
  for (j in 1:N2) {
    if((i-1)*freq2 < j && j <=i*freq2)
      whichbin2[j] <- i
  }
}
whichbin2<-gsub(pattern = "1", replacement = "VLow", whichbin2)
whichbin2<-gsub(pattern = "2", replacement = "Low", whichbin2)
whichbin2<-gsub(pattern = "3", replacement = "Medium", whichbin2)
whichbin2<-gsub(pattern = "4", replacement = "High", whichbin2)
whichbin2<-gsub(pattern = "5", replacement = "VHigh", whichbin2)
test_non_zero_scaled[,2]<-whichbin2

#plot frequencies in the bins
barplot(table(test_non_zero_scaled[,2]))

write.csv(train_non_zero_scaled, file = "train_non_zero_scaled.csv",row.names=FALSE)

If necessary, we might want to consider binning based on predictive value or clarifying the best cut-off numbers with the client experts.

Also, as we plan to apply the classification decision tree, we have decided to try equal frequency binning for the target variable.

#Binning Target Variable
Nt <- length(train_non_zero_scaled$log_target)
nbinst <- 5
whichbint <- c(rep(0, Nt))
freqt <- Nt/nbinst
train_non_zero_scaled_sorted <- train_non_zero_scaled[order(train_non_zero_scaled$log_target),]
for (i in 1:nbinst) {
  for (j in 1:Nt) {
    if((i-1)*freqt < j && j <=i*freqt)
      whichbint[j] <- i
  }
}
whichbint1<-gsub(pattern = "1", replacement = "VLow", whichbint)
whichbint2<-gsub(pattern = "2", replacement = "Low", whichbint1)
whichbint3<-gsub(pattern = "3", replacement = "Medium", whichbint2)
whichbint4<-gsub(pattern = "4", replacement = "High", whichbint3)
whichbint5<-gsub(pattern = "5", replacement = "VHigh", whichbint4)
train_non_zero_scaled_sorted$bin_target<-whichbint5
train_non_zero_scaled_sorted<-train_non_zero_scaled_sorted[c(1:3,73,4:72)]

#plot frequencies in the bins
barplot(table(train_non_zero_scaled_sorted$bin_target))

Modeling, Evaluation, and Reporting

Build several models and compare results

a. Clustering

We start with unsupervised clustering to see if we can identify any patterns of similarity across data rows. We apply hierarchical agglomerative clustering using default complete linkage.

##calculate distance matrix (default is Euclidean distance)
distance = dist(train_non_zero_scaled[,5:72])

# Hierarchical agglomerative clustering using default complete linkage 
train.hclust = hclust(distance)
#set_plot_dimensions(1600, 1200)
plot(train.hclust, cex=0.6)

member = cutree(train.hclust,3)
table(member)
## member
##   1   2   3 
## 128   8   5
##calculate the same for the test subset
distance2 = dist(test_non_zero_scaled[,3:69])
train.hclust2 = hclust(distance2)
member2 = cutree(train.hclust2,3)
table(member2)
## member2
##  1  2  3 
## 89  3  2

Clustering analysis on the entire standardized training subset did not yield any significant results; around 98% of the records have the same cluster membership. Therefore, to maintain consistency in evaluating the models, we will proceed with treating each test subset record individually as well.

b. Linear regression

Next, we proceed to supervised methods to achieve the goal of predicting future customer transactions.

Given the disguised nature of columns, we start with applying step-wise linear regression in both directions on the whole standardized training subset with the normalized target variable.

#*use train_non_zero_scaled on log_target*
full.model<-lm(log_target ~., data = train_non_zero_scaled[,3:72])
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = log_target ~ X963a49cdc + X68a945b18 + X935ca66a9 + 
##     X8.61E.26 + X26fc93eb7 + X66ace2992 + fb49e4212 + X6619d81fc + 
##     X1db387535 + ea772e115 + X2ec5b290f + X58e056e12 + X58e2e02e6 + 
##     eeb9cd3aa + X491b9ee45 + c10f31664 + X4bcf15776 + adb64ff71 + 
##     X11e12dbe8 + X62e59a501 + X15ace8c9f + X5c6487af1 + bb0ce54e9 + 
##     X5d3b81ef8 + X8781e4b91 + fc99f9426 + f97d9431e, data = train_non_zero_scaled[, 
##     3:72])
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.9690287 -0.6342392 -0.1149387  0.6987682  3.1207249 
## 
## Coefficients:
##               Estimate Std. Error   t value               Pr(>|t|)    
## (Intercept) 14.0764960  0.1089592 129.19055 < 0.000000000000000222 ***
## X963a49cdc   0.4766312  0.1666008   2.86092             0.00503360 ** 
## X68a945b18  -0.5251490  0.1996859  -2.62987             0.00973174 ** 
## X935ca66a9  -0.4812859  0.1660843  -2.89784             0.00451396 ** 
## X8.61E.26   -1.0638425  0.2233211  -4.76374         0.000005689103 ***
## X26fc93eb7   0.9323432  0.2246230   4.15070         0.000064527305 ***
## X66ace2992   1.1117268  0.2029142   5.47880         0.000000262849 ***
## fb49e4212    0.4686335  0.1552144   3.01927             0.00313281 ** 
## X6619d81fc   0.4895387  0.2255512   2.17041             0.03206848 *  
## X1db387535   0.2922852  0.1676668   1.74325             0.08400983 .  
## ea772e115   -0.8026186  0.2147010  -3.73831             0.00029241 ***
## X2ec5b290f   0.5282731  0.1865555   2.83172             0.00548272 ** 
## X58e056e12   0.5766069  0.1568835   3.67538             0.00036482 ***
## X58e2e02e6  -0.3287354  0.1403697  -2.34193             0.02093592 *  
## eeb9cd3aa   -0.2783087  0.1537363  -1.81030             0.07290679 .  
## X491b9ee45   0.6976034  0.1705011   4.09149         0.000080672794 ***
## c10f31664   -1.4584060  0.2393897  -6.09218         0.000000015779 ***
## X4bcf15776  -0.2945194  0.2009240  -1.46582             0.14547315    
## adb64ff71    0.6450074  0.2364863   2.72746             0.00740176 ** 
## X11e12dbe8   0.9890311  0.1798931   5.49788         0.000000241376 ***
## X62e59a501  -0.3518579  0.1589726  -2.21332             0.02888439 *  
## X15ace8c9f   0.2469232  0.1477093   1.67168             0.09735379 .  
## X5c6487af1  -0.3712544  0.1719917  -2.15856             0.03299972 *  
## bb0ce54e9   -0.8458740  0.2385202  -3.54634             0.00056971 ***
## X5d3b81ef8   0.5875417  0.1119477   5.24836         0.000000726627 ***
## X8781e4b91  -0.5330379  0.1995699  -2.67093             0.00868073 ** 
## fc99f9426    0.5712708  0.1926983   2.96459             0.00369769 ** 
## f97d9431e   -0.2651615  0.1902237  -1.39395             0.16606916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.293818 on 113 degrees of freedom
## Multiple R-squared:  0.5489271,  Adjusted R-squared:  0.4411486 
## F-statistic: 5.093105 on 27 and 113 DF,  p-value: 0.0000000003662682

The final model is significant as a whole, but explains only 55% of the variation in the data.

#Evaluate using the same training set
train_non_zero_scaled$pred_target_value_reg<-exp(predict(step.model,train_non_zero_scaled))
(rmse1a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_reg))
## [1] 5472996.278
(rmsle1a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_reg), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 1.158248888
par(mfrow = c(2, 2))
plot(step.model)

#test_non_zero_scaled$pred_target_value_reg<-exp(predict(step.model,test_non_zero_scaled))

c. Decision Tree

  • On continuous variable
#*use train_non_zero_scaled on log_target*
tree1<-rpart(log_target ~ ., data=train_non_zero_scaled[,3:72], method="anova", model=TRUE)
#summary(tree1)
tmp<-printcp(tree1)
## 
## Regression tree:
## rpart(formula = log_target ~ ., data = train_non_zero_scaled[, 
##     3:72], method = "anova", model = TRUE)
## 
## Variables actually used in tree construction:
##  [1] bb0ce54e9  c0d2348b7  cbb673163  ef30f6be5  fb49e4212  X166008929
##  [7] X1702b5bf0 X2ec5b290f X58e2e02e6 X6619d81fc X6eef030c1
## 
## Root node error: 419.35171/141 = 2.9741256
## 
## n= 141 
## 
##             CP nsplit  rel error     xerror        xstd
## 1  0.181565504      0 1.00000000 1.00383681 0.087798761
## 2  0.143087328      1 0.81843450 1.02562438 0.089021190
## 3  0.077285045      2 0.67534717 0.92000939 0.104195728
## 4  0.058790530      3 0.59806212 0.92807564 0.108825860
## 5  0.037491574      4 0.53927159 0.95356066 0.112960416
## 6  0.036841659      5 0.50178002 0.94590377 0.115178626
## 7  0.032574586      7 0.42809670 0.92121313 0.112424997
## 8  0.021829320      8 0.39552212 0.93655612 0.112242351
## 9  0.018115251      9 0.37369280 0.95756615 0.118192716
## 10 0.014100675     10 0.35557754 0.98509559 0.122510382
## 11 0.010152555     11 0.34147687 1.02150248 0.124023075
## 12 0.010000000     12 0.33132431 1.01876079 0.124058802
(rsq.val <- 1-tmp[,3])
##            1            2            3            4            5 
## 0.0000000000 0.1815655038 0.3246528321 0.4019378770 0.4607284070 
##            6            7            8            9           10 
## 0.4982199805 0.5719032987 0.6044778847 0.6263072043 0.6444224554 
##           11           12 
## 0.6585231306 0.6686756852
set_plot_dimensions(1200, 1200)
#plot(tree1, uniform=TRUE, main="Regression Tree for Continuous Normal Target ", margin=0.05)
#text(tree1, use.n=TRUE, all=TRUE, cex=.6)
prp(tree1)

After the twelfth split, this regression decision tree model explains around 67% of the variation in the data.

  • On binned variable
#*use train_non_zero_scaled_sorted on bin_target*
tree2<-rpart(bin_target ~ ., data=train_non_zero_scaled_sorted[,4:73], method="class",model=TRUE)
#summary(tree1)
printcp(tree2)
## 
## Classification tree:
## rpart(formula = bin_target ~ ., data = train_non_zero_scaled_sorted[, 
##     4:73], method = "class", model = TRUE)
## 
## Variables actually used in tree construction:
## [1] adb64ff71  cbb673163  ef30f6be5  fb49e4212  X190db8488 X2ec5b290f
## [7] X5d3b81ef8
## 
## Root node error: 112/141 = 0.79432624
## 
## n= 141 
## 
##            CP nsplit  rel error     xerror        xstd
## 1 0.107142857      0 1.00000000 1.16964286 0.027214983
## 2 0.098214286      1 0.89285714 1.13392857 0.031705770
## 3 0.071428571      2 0.79464286 1.03571429 0.040492204
## 4 0.062500000      3 0.72321429 0.94642857 0.045799356
## 5 0.040178571      4 0.66071429 0.87500000 0.048811223
## 6 0.026785714      6 0.58035714 0.83928571 0.049978737
## 7 0.017857143      8 0.52678571 0.86607143 0.049122971
## 8 0.010000000      9 0.50892857 0.86607143 0.049122971
#plot(tree2, uniform=TRUE, main="Classification Tree for Categorical Target ", margin=0.05)
#text(tree2, use.n=TRUE, all=TRUE, cex=.65)
prp(tree2)

#evaluating the first tree model (continous) using the same training set
train_non_zero_scaled$pred_target_value_tree1<-exp(predict(tree1, train_non_zero_scaled))
(rmse2a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_tree1))
## [1] 5265257.901
(rmsle2a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_tree1), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 0.9926713606
#evaluating the second tree model (categorical) using the same training set
train_non_zero_scaled_sorted$pred_target_value_tree2<-predict(tree2, train_non_zero_scaled_sorted, type = "class")
confusionMatrix(train_non_zero_scaled_sorted$pred_target_value_tree2,as.factor(train_non_zero_scaled_sorted$bin_target), dnn = c("Prediction", "True Value"))
## Confusion Matrix and Statistics
## 
##           True Value
## Prediction High Low Medium VHigh VLow
##     High     14   2      5     3    1
##     Low       1  11      2     1    0
##     Medium    4   4     16     2    1
##     VHigh     6   4      1    21    4
##     VLow      3   7      4     2   22
## 
## Overall Statistics
##                                                 
##                Accuracy : 0.5957447             
##                  95% CI : (0.5098909, 0.6774884)
##     No Information Rate : 0.2056738             
##     P-Value [Acc > NIR] : < 0.00000000000000022 
##                                                 
##                   Kappa : 0.4944329             
##  Mcnemar's Test P-Value : 0.1429546             
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium Class: VHigh
## Sensitivity           0.50000000 0.39285714     0.5714286    0.7241379
## Specificity           0.90265487 0.96460177     0.9026549    0.8660714
## Pos Pred Value        0.56000000 0.73333333     0.5925926    0.5833333
## Neg Pred Value        0.87931034 0.86507937     0.8947368    0.9238095
## Prevalence            0.19858156 0.19858156     0.1985816    0.2056738
## Detection Rate        0.09929078 0.07801418     0.1134752    0.1489362
## Detection Prevalence  0.17730496 0.10638298     0.1914894    0.2553191
## Balanced Accuracy     0.70132743 0.67872946     0.7370417    0.7951047
##                      Class: VLow
## Sensitivity            0.7857143
## Specificity            0.8584071
## Pos Pred Value         0.5789474
## Neg Pred Value         0.9417476
## Prevalence             0.1985816
## Detection Rate         0.1560284
## Detection Prevalence   0.2695035
## Balanced Accuracy      0.8220607
#test_non_zero_scaled$pred_target_value_tree<-exp(predict(tree1,test_non_zero_scaled))

d. Association Rules

To start off with this approach, we create a subset of the train_non_zero that includes only some of the variables marked as significant during regression analysis. After making every column in the subset categorical, we proceed with generating association rules for the variable “target” at 35% support and 60% confidence level.

#*use train_non_zero_scaled_sorted_ruled on target*
train_non_zero_scaled_sorted_ruled <- as(train_non_zero_scaled_sorted_ruled, "transactions")
target_rules <- apriori(data=train_non_zero_scaled_sorted_ruled, parameter=list (supp=0.035,conf = 0.6, minlen=3, maxlen=5), appearance = list (rhs=c("target=VLow", "target=Low", "target=Medium", "target=High", "target=VHigh")))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.035      3
##  maxlen target   ext
##       5  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4 
## 
## set item appearances ...[5 item(s)] done [0.00s].
## set transactions ...[35 item(s), 141 transaction(s)] done [0.00s].
## sorting and recoding items ... [35 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [58 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(target_rules[1:10])
##      lhs                         rhs                 support   confidence        lift count
## [1]  {c10f31664=Above Medium,                                                              
##       b43a7cfd5=Low}          => {target=VLow} 0.04964539007 0.6363636364 3.204545455     7
## [2]  {X20aa07010=Low,                                                                      
##       c10f31664=Above Medium} => {target=VLow} 0.03546099291 0.6250000000 3.147321429     5
## [3]  {X20aa07010=VLow,                                                                     
##       c10f31664=Above Medium} => {target=VLow} 0.04255319149 0.6000000000 3.021428571     6
## [4]  {X20aa07010=VLow,                                                                     
##       ea772e115=Medium}       => {target=VLow} 0.04255319149 0.6666666667 3.357142857     6
## [5]  {X26fc93eb7=Low,                                                                      
##       c10f31664=Above Medium} => {target=VLow} 0.05673758865 0.6666666667 3.357142857     8
## [6]  {c10f31664=Above Medium,                                                              
##       X024c577b9=VLow}        => {target=VLow} 0.04964539007 0.7777777778 3.916666667     7
## [7]  {X66ace2992=Low,                                                                      
##       c10f31664=Above Medium} => {target=VLow} 0.07092198582 0.6666666667 3.357142857    10
## [8]  {c10f31664=Above Medium,                                                              
##       ea772e115=Medium}       => {target=VLow} 0.07801418440 0.6470588235 3.258403361    11
## [9]  {X20aa07010=Medium,                                                                   
##       b43a7cfd5=Above Medium,                                                              
##       X024c577b9=Medium}      => {target=Low}  0.03546099291 0.7142857143 3.596938776     5
## [10] {X20aa07010=Low,                                                                      
##       c10f31664=Above Medium,                                                              
##       b43a7cfd5=Low}          => {target=VLow} 0.03546099291 0.7142857143 3.596938776     5
target_rules<-sort(target_rules, by="confidence", decreasing=TRUE)

set_plot_dimensions(1200, 1200)
plot(target_rules[1:10], measure = "support", method="graph", shading="confidence", cex=0.7)

e. Neural Networks

Performed with the help of Rattle package at 15 hidden layers.

f. Random Forest

set.seed(12)
train_non_zero_scaled=train_non_zero_scaled %>% mutate_if(is.character, as.factor)
target.forest=randomForest(log_target~.,data=train_non_zero_scaled[,3:72],mtry=15,importance=TRUE,ntree=250)
target.forest
## 
## Call:
##  randomForest(formula = log_target ~ ., data = train_non_zero_scaled[,      3:72], mtry = 15, importance = TRUE, ntree = 250) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 2.357481635
##                     % Var explained: 20.73
varImpPlot(target.forest, cex=0.5)

train_non_zero_scaled$pred_target_value_forest<-exp(predict(target.forest,train_non_zero_scaled))
(rmse3a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_forest))
## [1] 4683592.621
(rmsle3a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_forest), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 0.6434032396

Project Deployment

Findings Summary

To come up with recommendations for deployment, we first look at the statistical summary of model performance. Since the project objective is predicting the actual dollar value of the future transaction, we will focus on three supervised methods: linear regression vs regression decision tree vs random forest. We will use the insights from other models to enrich our proposal.

Models Evaluation Summary
Linear Regression Regression Decision Tree Random Forest of Regression Trees
R Squared 55% 67% 20%
RMSE on the same set $ 5472996.28 $ 5265257.9 $ 4683592.62
RMSLE on the same set 1.16 0.99 0.64

It’s worth reiterating that these results are only estimates as the target variable for the test set wasn’t provided. Running the same models on a different subset of the training file exhibited drastically worse numbers, which can mean 2 things:

  1. The time series nature of the data makes choosing the subset of the same file by different criteria impossible.

  2. Models are not trained enough to be very useful in practice.

We would need more information from the client to determine specific reasons.

As for other models, the classification tree achieved 60% accuracy and could be used to further analyze the dependencies among predictors and between predictors and the target variable. Yet to improve its power, we need to reevaluate the binning technique applied to both the feature and the target column, based on more accurate data & domain understanding.

Clustering did not yield any significant results on either training or test subset. Depending on the feedback from the client, we might want to revisit this unsupervised method along with principal component analysis. It could help us narrow the focus down to an even smaller group of the most significant features. Then we would have to recheck the partition validation.

Association rules provided valuable insights as multiple complex rules (3 - 5 elements) with specified 35% support and 60% confidence were found just for the subset of variables. It reinforces the hypothesis about a smaller number of truly important features that affect the target variable. However, the identical value of support for many of the top 10 rules raises concerns that require further investigation. It may be related to the equal frequency binning, so to improve the reliability of the insights from association rules, we will likely need to consider binning based on the target or client-defined boundary values.

Neural network with 15 hidden layers fits the data pretty well, as you can see on the respective graph, but pseudo R-square is the lowest across models. NN’s black box nature makes it harder to analyze the reasons for such performance. So, going forward, we will need to reconsider the inputs to this technique.

Recommendations

As we can see, random forest is doing better in every metric except for the percentage of variation explained. However, all metrics are not as high as desired.

Overall, given the volatile nature of anonymized data, the flexibility and advanced algorithms of random forest make it the model of choice in our recommendation. Analysts shouldn’t depend on a single data mining method but instead should seek a confluence of results from different models. That will provide robustness to the conclusions.

Important observation is the potential number of variables that actually matter. Regression output shows only several variables with significant individual p-values; decision trees stop after ~12 splits; and association rules identify numerous strong relationships between the subset of regression variables and the target. Further analyzing this observation can be the key to figuring out genuine time series features vs lag variables or fake noise data.

Steps going forward to improve the predictive power:

  • First, to support the iterative nature of CRISP-DM process, we will request the feedback on the variables and the domain they were derived from.

  • Second, we will evaluate the actual test RMSLE with the current modeling.

  • Based on the findings, we are going to solidify all the aspects of this time series.

  • After that, we adjust our data preparation methods and revisit clustering and PCA to improve the training set for the models.

  • So far, only basic transformation methods have been used. We will also need to consider potential formulas of a more complex relationship between predictors and the target (different powers, inverse correlations, etc.).

  • Finally, we will tune the models, prioritizing random forest, on the updated training set and reapply the predictions to the test set. We might also take a closer look at the support for association rules after enhanced discretizing.

Additional research questions:

  • Once we better understand the nature of the data, we would look at the potentially missed factors that can influence customer decisions.

  • Once satisfactory model results are reached, the analysis of existing customer communication methods needs to be conducted to determine the most appropriate channels that would use the project findings.